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1, INTRODUCTION 

Data mining and knowledge discovery in databases have been an active area of research lately [1]. 
Data mining applications are useful for commercial and scientific sides [2]. In healthcare application, it is an 
important method that can be used to detect unknown diseases [3] and identify effective treatments [4]. 
Data mining technique can be classified into two categories: Supervised and Unsupervised learning [5]. 
Supervised learning uses datasets that have labels while Unsupervised learning is one of the techniques that 
can be used to find patterns in unlabeled data sets. 

Clustering can be considered unsupervised learning technique. In data mining, clustering 1s one of 
the widely use fundamental task [6] and it 1s used to detect hidden structure or to outine the data category [7]. 
Clustering aims to find groups from unlabeled data such that all similar data objects 1s within the same cluster 
while dissimilar data objects from different cluster [8]. Other study uses overlapping clustering where data 
objects can belong to multi-cluster. 

According to the study [9], most of the real world datasets have overlapping information. 
Overlapping clustering has been used in many application from wireless sensor network [10] to social 
network interactions [11]. For example in social network analysis, the overlapping technique is used to detect 
actors that can belong to multiple communities [12]. Agglomerative hierarchical clustering is another method 
used to detect overlapping communities in a mobile network [13]. In a wireless sensor network, an energy 
efficient adaptive overlapping clustering method is established to improve energy efficiency for dynamic 
continuous monitoring [14]. An algorithm called OverCite which can detect overlapping communities of 
authors, papers and venues simultaneously using the publication hypergraph and the citation network 
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information [15]. Another study in network analysis called a density-based link clustering algorithm, 
its purpose to improve the accuracy of detecting overlapping communities in networks [16]. 

However, one of many challenging issues are noise or anomalous data, also known as outlier. 
Having outliers in the dataset may result in inaccurate analysis of data [17], provide a misleading statistical 
result and may potentially decrease the quality of a data analysis task. Due to this, outlier detection is an 
important data analysis task, its main objective is to detect anomalous or abnormal data from a given 
dataset [18]. 

In this regard, the study is focused on the development of an overlapping clustering application 
(OCA) that can identify overlapping clusters and outliers respectively. The study considered different 
research methods and algorithm for the development of the application. One of the algorithm used is the 
k-means algorithm, because of its simplicity to solve known clustering issues. The study also considered the 
used of median absolute deviation (MAD), it is known to be one of the most robust measures that are easy to 
use with the presence of outliers. Maximum distance (maxdist) is another method, it is used to identify data 
objects assigned to multi-cluster. The OCA application is limited only in handling numerical data. 


2. RESEARCH METHOD 
2.1. Operational Framework 

In this section, the study will demonstrate the workflow of the OCA as shown in Figure |. In OCA it 
1s necessary that data are converted into a standard spreadsheet and saved it in a csv extension format before 
loading the data. OCA will then checked the datasets if there are presence of outliers and these identified 
outliers are removed from the datasets. Then, data objects are clustered accordingly. Afterwards, clusters are 
checked if there are data objects that overlap within clusters. Finally, the result of the data analysis process 1s 
summarized and made available for data interpretation. 
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2.2. Main Function of OCA 
The main function of OCA application consists of three (3) phases. 


2.2.1. Phase 1: Outlier Detection using Median Absolute Deviation 

Outlier detection aims to find patterns in data that do not conform to expected behavior [19]. 
Removing and detecting outliers is very important in data mining [20], because it may greatly enhance the 
performance of statistical technique and data mining algorithms [21]. In order to detect and remove the 
outliers in the datasets the median absolute deviation (MAD) [22] is used in this study. The process of MAD 
1s discussed in the succeeding section. 
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To calculate using MAD, all the data objects will be collected and ranked in ascending order. 
Afterwards, the median value of the series of data objects is to be calculated. Henceforth, the calculated 
median will be subtracted from each data objects to get the median of absolute deviation. Afterwards, 
the results are to be sorted in ascending order to determine the median of absolute deviation. Then, the 
median will be multiplied by b to get the MAD value, where b=1.4826 [23]. In (1) shows the MAD formula. 


To determine the outlier, a criterion is computed by median plus or minus threshold value 
(+/-2, or 2.5, or 3) times the MAD to guide the outlier detection. By default, it is recommended that the 
threshold value of 2.5 1s a reasonable choice for outlier detection. In (2) shows the equivalent criteria value. 


M + 2.5 * MAD or M — 2.5 * MAD (2) 


All values less than or greater than the computed criterion is considered outliers. This outlier is 
removed from the datasets before the partition of data objects to form a cluster. 


2.2.2. Phase 2: Clustering Using K-Means Algorithm 

K-means is one of the oldest and most popular clustering techniques [24]. It is easy to implement 
and apply even on large data sets [25]. In this section, the researchers discussed how k-means 
algorithm works. 

First, the user enters the number of k clusters, and then the algorithm randomly initializes cluster 
centroid, one for each cluster. Then, the algorithm calculates the distance of all data objects to the initial 
centroids using Euclidian distance. Data objects are categorized to its nearest cluster centroid and then cluster 
centroid is recalculated. This process iterates until the assignments of data objects do not change. In (3) 
shows the Euclidian distance formula [26]. 


Aructidian (x,y) = V 2A _ yi)? (3) 


2.2.3. Phase 3: Overlapping Clustering Using Maxdist 

In this section, identification of objects to multiple clusters using maxdist [27] 1s explained. After 
the formation of clusters using k-means algorithm, calculated distances of each data object assigned on each 
cluster are saved. The maxdist (maximum distance of an object allowed in a cluster) recorded from each 
cluster was used as the global threshold in identifying objects that can belong to one or more clusters. Then, 
the distance of the data object from their respective cluster is calculated to the final centroid of the other 
cluster. The calculated distance is compared with the maxdist of the other cluster final centroid. If the 
distance is less than maxdist, then that data object is identified pattern that overlaps with the other 
final centroid. 

In Figure 2, an example of data with three given clusters 1s shown. To determine whether data object 
x1 in Cluster 1 overlaps with Cluster 2 the distance of data object x1 is calculated with the final centroid 
(cent 2) of Cluster 2. Then, the computed distance is compared with data object y3, where y3 is equivalent to 
the maxdist of Cluster 2. If the distance of x1 1s less than the maxdist then x1 is considered data object that 
overlap with Cluster 2. This method iterates with all the other clusters. 
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Figure 2. Identification of overlapping patterns 
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2.3. Visualization Result 

The clustering results can be visualized in OCA through a 2-dimensional space or graph. 
Data objects are characterized by a colored circle dots or points which are represented by a randomly 
assigned color as a representation for its cluster assignment. While red circle dots or points signifies 
identified outliers in the datasets. Points that overlap from one cluster to another are circle dots marked with 
dark border. The cluster centroid (+) 1s used to represent the composition of clusters. Python programming 
language was used for the development of the OCA. Figure 3 shows an example of a visualization result 
window. In Figure 4 shows the result window that displays the detailed information of the data analysis 
processes. 


Visualization of Clusters mee Detop Sempetansete Visualization of Clusters 
a ? \ * . oO 





Figure 3. Visualization result Figure 4. Detailed information result 


It provides the number of instances and attributes, the number of data objects assigned on each 
cluster and the identified outliers from each cluster as well as the overlapping assignment of data. 


3. RESULTS AND ANALYSIS 
In this section, experiments were conducted to test the developed OCA. The application was 
implemented using two datasets, synthetic and real datasets. 


3.1. Experiment 1 

The first experiment used synthetic dataset. The dataset 1s composed of two numerical attributes 
with 327 instances. Five were introduced to serve as oultiers data. There are 322 instances that are normal 
data and 5 instances are outliers. 

The data objects are plotted through a 2-dimensional space provided by OCA as shown in Figure 5. 
First, OCA will used MAD for the identification of outliers in the datasets. Figure 6 shows the visualization 
result were outliers are identified by OCA. The red circle dots are considered identified outliers in 
the datasets. 
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Figure 3. Synthetic datasets scatter plot Figure 6. OCA detected outliers 
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As shown in the visualization result OCA correctly identified all five (5) outliers in the datasets. 
In Figure 7 shows that outliers are now removed from the datasets. 

For the clustering processes, user determines the number of k clusters, wherein the user utilized k=4 
in this study. The OCA application takes an input of 4 initial cluster center and each data object is assigned to 
its nearest cluster center. The clusters in 2-dimensional data space are marked with a randomly assigned color 
as arepresentation of the primary belonging of a data objects as shown in Figure 8. 
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Figure 7. Scatter plot of datasets without outliers Figure 8. OCA clusters assignment result 


Then, OCA will use this maxdist in assigning data objects to multiple clusters. As shown in 
Figure 9, circle dots marked with dark border are identified data objects that can belong to other clusters. 
Finally, Tables 1-3 illustrated the detailed information of the experiment done using the synthetic datasets. 
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Figure 9. Simulation of data objects that overlap within clusters 


Table 1. Implementation Result of Detected Outliers 


No. of data No. of Found 
Objects Outliers Outliers 
327 5 5 
Table 2. Implementation Result of Clustered Data Table 3. Implementation Result of Overlap 
ee Clusters 
Clusters No. of data objects Clusters Overlaps Overlaps Overlaps Overlaps 
CO 153 with CO with Cl with C2 with C3 
Cl 48 CO : 0 0 13 
C2 25 Cl 0 : 0 0 
C3 96 C2 0 1 é 0 
C3 45 0 0 - 
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Experiment | results shows that OCA accurately identified outliers in the datasets and was able to 
discover overlap patterns effectively. 


3.2. Experiment 2 

In this section, real dataset was obtained from UCI Machine learning repository. The obtained data 
is the IRIS plants dataset that has 150 instances with 4 (sepal length, sepal width, petal length, petal width) 
numerical attributes. Figure 10 shows the visualization result under sepal width and sepal length attributes. 
In Figure 11 shows the outliers in the iris datasets that were identified by the OCA application. 
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Figure 10. Iris datasets scatter plot under sepal width Figure 11. Outliers under sepal width and 
and sepal length atributes sepal length 


These identified outliers are removed from the datasets. Then, OCA takes an input of k from the 
user, where k=3 clusters as shown in Figure 12. Finally, the identification of overlap patterns as shown in 
Figure 13. 
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Figure 12. OCA clusters assignment result under Figure 13. OCA clusters assignment result under 
sepal width and sepal length sepal width and sepal length 


Another experiment where conducted, at this stage the used of petal width and petal length attributes 
of the iris dataset. The following simulated results are shown in Figure 14. 
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Figure 14. Simulation result under petal width and petal length attributes 


In Table 4 illustrated the detailed information of the experiment done using the four (4) numerical 
attributes (sepal width, sepal length, petal width, petal length) of the iris plant datasets. The experimental 
result shows that OCA found a total of 6 outliers out of 150 instances under sepal width and sepal length 
while none in petal width and petal length. For the identification of overlapping patterns under sepal width 
and sepal length, a total of 77 identified patterns out of 150 instances that overlapped between clusters while 
in petal width and petal length a total of 34 patterns were identified. 


Table 4. Implementation Result Under Iris Plant Dataset 


Attributes Outliers Clusters No. of Data Overlap Overlap Overlap 
Objects with CO with Cl with C2 
SEPAL CO A9 - 38 15 
Width and 6 Cl 50 0 - 0 
Length C2 45 24 0 - 
PETAL CO 63 - 0 17 
Width and 0 Cl 50 0 - 0 
Length C2 37 17 0 - 


Based from the above results, the developed OCA prove its capability to provide better 
identification of clusters that overlap and outliers accordingly. 


4. CONCLUSION AND FUTURE WORKS 

The study presented an overlapping clustering application or OCA for data analysis. Based on the 
experimental results, the developed OCA demonstrated its capability in terms of detecting the abnormal 
values (outliers) and identification of clusters with overlaps. OCA is very useful data analysis tool for outlier 
detection analysis, data clustering and detection of overlapping clusters. 

Despite providing a good result, it is recommended that more tests need to be done. The developed 
OCA only works with numerical datasets; therefore, modification of the application can be considered for 
future works. Furthermore, it 1s recommended that an alternative approach, which is not sensitive to the 
random initialization of cluster center, be considered as future study. 
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